1) Defining how we assess performance

What do we mean by "loss"?

Screenshot taken from Coursera 1:00

How do we formalize this notion of how much we're losing? And in machine learning, we do this by defining something called a loss function.

And what the loss function specifies is the cost incurred when the true observation is y, and I make some other prediction. So, a bit more explicitly, what we're gonna do, is we're gonna estimate our model parameters. And those are $\hat w$. We're gonna use those to form predictions.

  • $f_{\hat w}(x) = \hat f(x)$, it's our predicted value at some input x.

The loss function L, is somehow measuring the difference between these two things.

And there are a couple ways in which we could define loss function. And very common choices include assuming something that's called absolute error, which just looks at the absolute value of the difference between your true value and your predicted value. And another common choice is something called squared error, where, instead of just looking at the absolute value, you look at the square of that difference. And so that means that you have a very high cost if that difference is large, relative to just absolute error.

Screenshot taken from Coursera 3:30

2) 3 measures of loss and their trends with model complexity

1) Training error: assessing loss on the training set

The first measure of error of our predictions that we can look at is something called training error. And we discussed this at a high level in the first course of the specialization, but now let's go through it in a little bit more detail.

So, to define training error, we first have to define training data. So, training data typically you have some dataset which I've shown you are these blue circles here, and we're going to choose our training dataset just some subset of these points. So, the greyed circles are ones that are not included in the training set. The blue circles are the ones that we're keeping in this training set. And then we take our training data and, as we've discussed in previous modules of this course, we use it in order to fit our model, to estimate our model parameters. Just as an example, for example with this dataset here, maybe we choose to fit some quadratic function to the data and like we've talked about in order to fit this quadratic function, we're gonna minimize the residual sum of squares on these training data points.

Screenshot taken from Coursera 1:00

So, now we have our estimated model parameters, w hat. And we want to assess the training error of that estimated model. And the way we do that is first we need to define some lost functions. So, maybe we look at squared error, absolute error.

And then the way training error's defined is simply as the average loss, defined over the training points. So, mathematically what this is is simply: $$\dfrac{1}{N} \sum_{i=1}^N L(y_i, f_{\hat w}(x_i))$$

  • N: are the total number of observations in my training set

And just to remember to be very clear the estimated parameters were estimated on the training set. They were minimizing the residual sum of squares for these training points that we're looking at again and defining this training error.

Screenshot taken from Coursera 2:00

So, we can go through this pictorially in the following example, where in this case we're specifically looking at using squared error as our loss function. And in this case, our training error is simply $\dfrac{1}{N}$ times the sum of the difference between our actual house sales price and our predicted house sales price squared, where that sum is taken over all houses in our training data set. And what we see is that in this case where we choose squared error as our loss function, then the form of training error is exactly $\dfrac{1}{N}$ times our residual sum of squares. So, just be aware of that when you're computing training error and reporting these numbers. Here we're defining it as the average loss.

Screenshot taken from Coursera 3:00

More formally we can write our training error as follows and then we can define something that's commonly referred to just as something as RMSE and the full name is root mean square error. And RMSE is simply the square root of our average loss on the training houses. So, the square root of our training error. And the reason one might consider looking at root mean square error is because the units, in this case, are just dollars. Whereas when we thought about our training error, the units were dollars squared.

Screenshot taken from Coursera 3:39

Now, that we've defined training error, we can look at how training error behaves as model complexity increases. So, to start with let's look at the simplest possible model you might fit, which is just a constant model. So this is the simplest model we're gonna consider, or could consider, and you see that there is pretty significant training error.

Then let's say I fit a linear model. Well, a line, these are all linear models we're looking at, it's linear regression. But just fitting a line to the data. And you see that my training error has gone down.

Then I fit a quadratic function again training error goes down, and what I see is that as I increase my model complexity to maybe this higher order of polynomial, I have very low training error just this one pink bar here. So, training error decreases quite significantly with model complexity .

So, there's a decrease in training error as you increase your model complexity. And why is that? Well, it's pretty intuitive, because the model was fit on the training points and then I'm saying how well does it fit it? As I increase the model complexity, I'm better and better able to fit my training data points. So, then when I go to assess my training error with these high-complexity models, I have very low training error.

Screenshot taken from Coursera 5:00

So, a natural question is whether a training error is a good measure of predictive performance? And what we're showing here is one of our high-complexity, high-order polynomial models that had very low training error. So it really fit those training data points well. But how's it gonna perform on some new house?

Screenshot taken from Coursera 6:00

So, in particular, maybe we're looking at a house in this gray region, so with this range of square feet. Question is, is there something particularly wrong with having $x_t$ square feet? Because what our fitted function is saying is that I believe or I'm predicting that the values of houses with roughly Xt square feet are less valuable than houses with fewer square feet, cuz there's this dip down in this function. Do we really believe that this is a true dip in value, that these houses are just less desirable than houses with fewer or more square feet? Probably not. So, what's going wrong here?

Screenshot taken from Coursera 6:45

The issue is the fact that training error is overly optimistic when we're going to assess predictive performance. And that's because these parameters, $\hat w$, were fit on the training data. They were fit to minimize residual sum of squares, which can often be related to training error. And then we're using training error to assess predictive performance but that's gonna be very very optimistic as this picture shows. So, in general, having small training error does not imply having good predictive performance unless your training data set is really representative of everything that you might see there out in the world.

Screenshot taken from Coursera 7:30

2) Generalization error: what we really want

So, instead of using training error to assess our predictive performance. What we'd really like to do is analyze something that's called generalization or true error. So, in particular, we really want an estimate of what the loss is averaged over all houses that we might ever see in our neighborhood. But really, in our dataset we only have a few examples of houses that were sold. But there are lots of other houses that are in our neighborhood that we don't have in our dataset, or other houses that you might imagine having been sold.

Screenshot taken from Coursera 0:30

Okay, so to compute this estimate over all houses that we might see in our dataset, we'd like to weight these house pairs, so the pair of house attributes and the house sale's price. By how likely that pair is to have occurred in our dataset. So to do this we can think about defining a distribution and in this case over square feet of houses in our neighborhood.

What this picture is showing is a distribution that says we're very unlikely to see houses with very small or low number of square feet, very small houses. And we're also very unlikely to see really, really massive houses. So there's some bell curve to this, there's some sweet spot of kind of typical houses in our neighborhood, and then the likelihood drops off from there.

Screenshot taken from Coursera 1:30

Likewise what we can do is define a distribution that says for a given square footage of a house, what's the distribution over the sales price of that house? ? So let's say the house has 2,640 square feet. Maybe I expect the range of house prices to be somewhere between 680,000 to maybe 950,000. That might be a typical range. But of course, you might see much lower valued houses or higher value, depending on the quality of that house.

Screenshot taken from Coursera 1:39

Formally when we go to define our generalization error, we're saying that we're taking the average value of our loss weighted by how likely those pairs were in our dataset.

So specifically we estimate our model parameters on our training data set so that's what gives us $\hat w$. That defines the model we're using for prediction, and then we have our loss function, assessing the cost of predicting $f_{\hat w}$ at our square foot x when the true value was y. And then what we're gonna do is we're gonna average over all possible (x,y). But weighted by how likely they are according to those distributions over square feet and value given square feet.

Screenshot taken from Coursera 3:00

Let's go back to these plots of looking at error versus model complexity. But in this case let's quantify our generalization error as a function of this complexity.

And to do this, what I'm showing by this crazy blue region here. And, it has different gradation going from white to darker blue, is the distribution of houses that I'm likely to see in my dataset. So, this white region here, these are the houses that I'm very likely to see, and then as I go further away from the white region I get to less likely house sale prices given a specific square foot value.

And so what I'm gonna do when I look at thinking about generalization error is I'm gonna take my fitted function where remember this green line was fit on the training data which are these blue circles. And then I'm gonna say, how well does it predict houses in this shaded blue region, weighted by how likely they are, how close to that white region.

Okay, so what I see here is this constant model who really doesn't approximate things well except maybe in this region here. So overall it has a reasonably high generalization error and I can go to my more complex model.

Screenshot taken from Coursera 5:00

Then I get to this much higher order polynomial, and when we were looking at training error, the training error was lower, right? But now, when we think about generalization error, we actually see that the generalization error is gonna go up relative to the simpler model.

Screenshot taken from Coursera 6:50

So our generalization error in general will have some shape where it's going down. And then we get to a point where the error starts increasing. Sorry, that should have been a smoother curve. The error starts increasing because we're getting to these overly complex models that fit the training data really well but don't generalize to other houses that we might see.

But importantly, in contrast to training error we can't actually compute generalization error. Because everything was relative to this true distribution, the true way in which the world works. How likely houses are to appear in our dataset over all possible square feet and all possible house values. And of course, we don't know what that is. So, this is our ideal picture or our cartoon of what would happen. But we can't actually go along and compute these different points.

Screenshot taken from Coursera 8:00

3) Test error: what we can actually compute

So we can't compute generalization error, but we want some better measure of our predictive performance than training error gives us. And so this takes us to something called test error, and what test error is going to allow us to do is approximate generalization error.

And the way we're gonna do this is by approximating the error, looking at houses that aren't in our training set.

Screenshot taken from Coursera 1:00

So instead of including all these colored houses in our training set, we're gonna shade out some of them, these shaded gray houses and we're gonna make these into what's called a test set.

Screenshot taken from Coursera 1:15

And when we go to fit our models, we're just going to fit our models on the training data set. But then when we go to assess our performance of that model, we can look at these test houses, and these are hopefully going to serve as a proxy of everything out there in the world. So hopefully, our test data set is a good measure of other houses that we might see, or at least in order to think of how well a given model is performing.

Screenshot taken from Coursera 1:25

So test error is gonna be our average loss computed over the houses in our test data set.

  • $N_{test}$: are the number of houses in our test data
  • $\hat w$: very important, estimated parameters were fit on the training data set

Okay, so even though this function looks very much like training error, the sum is over the test houses, but the function we're looking at was fit on training data. Okay, so these parameters in this fitted function never saw the test data.

Screenshot taken from Coursera 2:20

So just to illustrate this, we might think of fitting a quadratic function through this data, where we're gonna minimize the residual sum of squares on the training points, those blue circles, to get our estimated parameters $\hat w$.

Screenshot taken from Coursera 2:33

Then when we go to compute our test error, which in this case again we're gonna use squared error as an example, we're computing this error over the test points, all these grey different circles here. So test error is $\dfrac{1}{N}$ times the sum of the difference between our true house sales prices and our predicted price squared summing over all houses in our test data set.

Screenshot taken from Coursera 2:45

Let's summarize our measures of error as a function of model complexity

  • Our training error decreased with increasing model complexity.
  • In contrast, our generalization error went down for some period of time. But then we started getting to overly complex models that didn't generalize well, and the generalization error started increasing. So here we have generalization error. Or true error
  • Our test error is a noisy approximation of generalization error. Because if our test data setting included everything we might ever see in the world in proportion to how likely it was to be seen, then that would be exactly our generalization error. But of course, our test data set is just some finite data set, and we're using it to approximate generalization error, so it's gonna be some noisy version of this curve here.

Test error is the thing that we can actually compute. Generalization error is the thing that we really want.

Screenshot taken from Coursera 3:00

4) Defining overfitting

The notion of overfitting is if you have a model with parameters $\hat w$. In this model, there exists an estimated parameters, I'll just call them $w'$.

The model is overfit with two conditions hold:

  • training error ($\hat w$) < training error ($w'$).
  • true error ($\hat w$) > true error ($w'$).

Generally, the models are overfit, are the ones that have smaller training error. These are the ones that are really highly fit to the training data set but don't generalize well. Whereas the other points on the other half of this space are the ones that are not really well fit to the training data and also don't generalize well.

Screenshot taken from Coursera 2:00

5) Training/test split

So we've said to assess the performance of our model, we really need to have a test data set carved out from our full data set. So, this raises the question of, how do I think about dividing the data set into training data versus test data?

  • If I put too few points in my training set, then I'm not going to estimate my model well. And so, I'm going to have clearly bad predictor performance because of that.
  • If I put too few points in my test set, that's gonna be a bad approximation to generalization error.

A general rule of thumb is typically you want just enough points in your test set to approximate generalization error well. And you want all your points in your training data set. Because you want to have as many points in your training data set to learn a good model.

Screenshot taken from Coursera 1:00

3) 3 sources of error and the bias-variance tradeoff

1) Irreducible error and bias

We've talked about three different measures of error. And now in this part, we're gonna talk about three different sources of error. And this is gonna lead us into a conversation of the bias variance trade-off. Okay, so when we were forming our prediction, there are three different sources of error.

  • Noise
  • Bias
  • Variance

Screenshot taken from Coursera 0:30

Let's look at the noise term

As we've mentioned many times in this specialization, data are inherently noisy.

So the way the world works is that there's some true relationship between square feet and the value of a house. Or generically, between x and y. And we're representing that arbitrary relationship defined by the world, by $f_{w(true)}$, which is the notation we're using for that functional relationship.

But of course that's not a perfect description between x and y. The number of square feet and the house value. There are lot of other contributing factors including other attributes of the house that are not included just in square feet or how a person feels when they go in and make a purchase of a house or a personal relationship they might have with the owners. Or lots and lots of other things that we can't ever perfectly capture with just some function between square feet and value, and so that is the noise that's inherent in this process represented by this epsilon term ($\epsilon$). So in particular for any observation $y_i$ it's the sum of this relationship between the square feet and the value plus this noise term $\epsilon_i$ specific to that i. house.

And we've talked before about our assumption that this noise has zero mean because if it didn't that could be shoved into the f function instead. But what we haven't talked about is the spread of that noise. So at any given square feet what kind of variation and house price are we likely to see based on this type of noise that's inherent in our observations. And so this is referred to as the variance of this noise term epsilon. And this is something that's just a property of the data. We don't have control over this. This has nothing to do with our model nor our estimation procedure, it's just something that we have to deal with. And so this is called Irreducible error because it's nothing that we can reduce through choosing a better model or a better estimation procedure.

Screenshot taken from Coursera 2:45

The things that we can control are bias and variance, so we're gonna focus quite heavily on those two terms. So let's start by talking about bias. And this is basically just an assessment of how well my model can fit the true relationship between x and y.

So to think about this, let's think about how we get data in our data set. So here these points that we observed they're just a random snapshot of N houses that were sold and recorded and we tabulated in our data set. Well, based on that data set, we fit some function and, thinking about bias, it's intuitive to start which is a very simple model of just a constant function. But what if another set of N houses had been sold? Then we would have had a different data set that we were using. And when we went to fit our model, we would have gotten a different line.

In the first data set, I tended to draw points that were below the true relationship so they happen to have, our houses in our data set happened to have values less than what the world kind of specifies as typical. And on the right hand side I drew points that tended to lie above the line. So these are pretty extremely different data sets, but what you see is that the fits are pretty similar.

Screenshot taken from Coursera 4:00

So what we are saying is, over all possible data sets of size N that we might have been presented with of house sales, what do we expect our fit to look like?

There's a continuum ofpossible fits we might have gotten. And for all those possible fits, here this dashed green line represents our average fit, averaged over all those fits weighted by how likely they were to have appeared.

Screenshot taken from Coursera 5:00

Now we can start talking about bias. What bias is, is it's the difference between this average fit and the true function, $f_{w(true)}$.

That's what this equation shows here, and we're seeing this with this gray shaded region. That's the difference between the true function and our average fit. And so intuitively what bias is saying is, is our model flexible enough to on average be able to capture the true relationship between square feet and house value. And what we see is that for this very simple constant model, this low complexity model has high bias. It's not flexible enough to have a good approximation to the true relationship. And because of these differences, because of this bias, this leads to errors in our prediction.

Screenshot taken from Coursera 6:15

2) Variance and the bias-variance tradeoff

Let's turn to this third component which is a variance.

And what variance is gonna say is, how different can my specific fits to a given data set be from one another, as I'm looking at different possible data sets? And in this case, when we are looking at just this constant model, we showed by that early picture where I drew points that were mainly above the true relationship and the points mainly below, that the actual resulting fits didn't vary very much. And when you look at the space of all possible observations, you see that the fits, they're fairly similar, they're fairly stable.

Screenshot taken from Coursera 0:30

When you look at the variation in these fits, which I'm drawing with these grey bars here. We see that they don't vary very much.

Screenshot taken from Coursera 0:54

So, for this low complexity model, we see that there's low variance. So, to summarize what this variance is saying is, how much can the fits vary? And if they could vary dramatically from one data set to the other, then you would have very erratic predictions. Your prediction would just be sensitive to what data set you got. So, that would be a source of error in your predictions.

Screenshot taken from Coursera 1:10

And to see this, we can start looking at high-complexity models. So in particular, let's look at this data set again. And now, let's fit some high-order polynomial to it.

In the right dataset, let's choose two points, which I'm gonna highlight as these pink circles. And let's just move them a little bit. So, out of this whole data set, I've just moved two observations and not too dramatically, but I get a dramatically different fit.

Screenshot taken from Coursera 1:20

So then, when I think about looking over all possible data sets I might get, I might get some crazy set of curves. There is an average curve. And in this case, the average curve is actually pretty well behaved. Because this wild, wiggly curve is at any point, equally, likely to have been wild above, or wild below. So, on average over all data sets, it's actually a fairly smooth reasonable curve. But if I look at the variation between these fits, it's really large. So, what we're saying is that high-complexity models have high variance.

Screenshot taken from Coursera 2:30

On the other, if I look at the bias of this model, so here again, I'm showing this average fit which was this fairly well behaved curve. And match pretty well to the true relationship between square feet and house value, because my model is really flexible. So on average, it was able to fit pretty precisely that true relationship. So, these high-complexities models have low bias.

Screenshot taken from Coursera 3:00

We can now talk about this bias-variance tradeoff. So, in particular, we're gonna plot bias and variance as a function of model complexity.

  • Model complexity increases, our bias decreases.
  • Model complexity increases, variance increases.So, our very simple model had very low variance, and the high-complexity models had high variance.

what we see is there's this natural tradeoff between bias and variance. And one way to summarize this is something that's called mean squared error.

MSE = bias$^2$ + variance

Machine learning is all about this tradeoff between bias and variance. And the goal is finding this sweet spot. This is the sweet spot where we get our minimum error, the minimum contribution of bias and variance, to our prediction errors.

But just like with generalization error, we cannot compute bias and variance, and mean squared error. Well, the reason is because just like with generalization error, they were defined in terms of the true function. Well, bias was defined very explicitly in terms of the relationship relative to the true function. And when we think about defining variance, we have to average over all possible data sets, and the same was true for bias too. But all possible data sets of size n, we could have gotten from the world, and we just don't know what that is. So, we can't compute these things exactly. But throughout the rest of this course, we're gonna look at ways to optimize this tradeoff between bias and variance in a practical way.

Screenshot taken from Coursera 6:00

3) Error vs. amount of data

Let's start with looking at our true error or generalization error. But first, I want to make sure its clear that we are looking at these errors for a fixed model complexity.

If we have very few data points, our fitted function is a pretty poor estimate of the true relationship between x and y. So our true error's gonna be pretty high, so let's say that w hat is not approximated well from few points. But as we get more and more data, we get a better and better approximation of our model and our true error decreases. But it decreases to some limit.

And what is that limit? Well that limit is the bias plus the noise inherent in the data. Because as we get tons and tons of observations, well, we're taking our model and fitting it as well as we could ever hope to fit it, because we have every observation out there in the world. But the model might just not be flexible enough to capture the true relationship between x and y, and that is our notion of bias. Plus, of course, there's the error just from the noise in observations that other contribution. Okay, so this difference here Is the bias of the model and noise of the data.

Now let's look at training error. So let's say our training error starts somewhere. But what ends up happening is training error goes up as you get more and more data points. So when we have few data points, so with few data points, a fixed complexity model can fit them reasonably well, where reasonably of course depends on what the complexity of the model is. But as I get more and more and more data points, that same complexity of the model can't hope to fit all these points perfectly well. What is the limit of training error? That limit is exactly the same as the limit of our true error.

The reason is I have tons and tons of points there. That's all points that there could ever be possibly in the world, and I fit my model to it. And if I measure training error, I'm running it to all the possible points there are out there in the world. And that's exactly what our definition of true error is. So they converge to exactly the same point in the limit. Where that difference again, is the bias inherent from the lack of flexibility of the model, plus the noise inherent in the data.

So just to write this down:

  • In the limit, I'm getting lots and lots of data points, this curve is gonna flatten out to how well model can fit true relationship $f_{true}$.
  • In the limit, true error = training error.

So what we've seen so far in this module are three different measures of error. Our training, our true generalization error as well as our test error approximation of generalization error. And we've seen three different contributions to our errors. Thinking about that inherent noise in the data and then thinking about this notion of bias in variance. And we finally concluded with this discussion on the tradeoff between bias in variance and how bias appears no matter how much data we have. We can't escape the bias from having a specified model of a given complexity.

Screenshot taken from Coursera 5:00

4) Formally defining and deriving the 3 sources of error

1) Formally defining the 3 sources of error

So you mentioned that the training set is just a random sample of some and observations. In this case, some N houses that were sold and recorded, but what if N other houses had been sold and recorded? How would our performance change? So for example, here in this picture we're showing one set of N observations that are used for training data, those are the blue circles. And we fit some quadratic function through this data and here we show some other set of N observations and we see that we get a different fit.

Screenshot taken from Coursera 1:00

And to assess our performance of each one of these fits we can think about looking at generalization error.

  • So in the first case we might get one generalization error of this specific fit $\hat w(1)$.
  • And in the second case we would get some different evaluation of generalization error. Let's call it generalization error of $\hat w(2)$.

Screenshot taken from Coursera 1:30

But one thing that we might be interested in is, how do we perform on average for a training data set of N observations?

Because imagine them trying to develop a tool that's gonna be used by real estate agents to form these types of predictions. Well I like to design my tool, package it up and send it out there, and then a real estate agent might come in and have some set of observations of house sales from their neighborhood that they're using to make their predictions. So that might be different than another real estate agent.

And what I'd like to know, is for a given amount of data, some training set of size N, how well should I expect the performance of this model to be, regardless of what specific training dataset I'm looking at? So in these cases what we like to do is average our performance over all possible fits that we might get. What I mean by that is all possible training data sets that might have appeared, and the resulting fits on those data sets.

Screenshot taken from Coursera 1:50

So formerly, we're gonna define this thing called expected prediction error which is the expected value of our generalization error, over different training data sets. So very specifically, for a given training data set, we get parameters that are fit to that data set. So I'll call that $\hat w$ of training set. And then for that estimated model, I can evaluate my generalization error and what the expected prediction error is doing is it's taking a weighted average over all possible training sets that I might have seen. Where for each one I get a different set of estimated parameters and thus a different notion of the generalization error.

Screenshot taken from Coursera 3:00

And to start analyzing this quantity of prediction error, let's specifically look at some target input $x_t$, which might be a house with 2,640 square feet. And let's also take our loss function to be squared error. So in this case when we're talking specifically about a target point $x_t$. What we can do later after we do the analysis specifically for $x_t$ is we can think about averaging this over all possible $x_t$, over all x all square feet. But in some cases we might actually be interested in one region of our input space in particular. And then when we talk about using squared error in particular, this is gonna allow our analysis to follow through really nicely as we're gonna show not in this video, but in our next even more in depth video which is also optional.

Screenshot taken from Coursera 4:00

But under these assumptions of looking specifically at $x_t$ and looking at squared error as our measure of loss. You can show that the average prediction error at xt is simply the sum of three terms which we're gonna go through: $\sigma$ (sigma), bias, variance.

So these terms are yet to be defined, and this is what we're gonna walk through in this video in a much more formal way than we did in the previous set of slides.

Screenshot taken from Coursera 4:35

So let's start by talking about this first term, sigma squared and what this is gonna represent is the noise we talked about in the earlier videos.

So in particular, remember that we're saying that there's some true relationship between square feet and house value. That that's just a relationship that exists out there in the world, and that's captured by $f_{w(true)}$, but of course that doesn't fully capture how we think about the value of a house. There are other factors at play. And so all those other factors out there in the world are captured by our noise term, which here we write as just an additive term plus epsilon.

So epsilon is our noise, and we said that this noise term has zero meaning cuz if not we can just shove that other component into $f_{w(true)}$. But we're just gonna make the assumption that epsilon has 0 mean then we can start talking about what is the spread of noise you're likely to see at any point in the input space. And that spread is called the variance. So we denote it by sigma squared and sigma squared is the variance of this noise epsilon.

And as we talked about before, this noise is just noise that's out there in the world, we have no control over it no matter how complicated and interesting of a model, we specify our algorithm for fitting that model. We can't do anything about the fact that we're using x for our prediction. But there's just inherently some noise in how our observations are generated in the world. So for this reason, this is called our irreducible error. Because it's noise that we can't reduce through any choices that we have control over.

Screenshot taken from Coursera 5:50

So now let's talk about this second term, bias squared.

And remember that when we talked about bias this was a notion of how well our model could on average fit the true relationship between x and y. But now let's go through this at a much more formal level. And in particular let's just remember that there's some relationship between square feet and house value in our case which is represented by this orange line. And then from this true world we get some data set and to find a training set which are these blue circles. And using this training data we estimate our model parameters. Well, if we had gotten some other set of endpoints, we would have fit some other functions.

Screenshot taken from Coursera 7:00

Now, when I look over all possible data sets of size N that I might have gotten, where remember where this blue shaded region here represents the distribution over x and y. So how likely it is to get different combinations of x and y. And let's say, I draw endpoints from this joint distribution over x and y and over all possible values I look at an estimated function. So for example here are the two, estimated functions from the previous slide, those example data sets that I showed. But of course there's a whole continuum of estimated functions that I get for different training sets of size N. Then when I average these estimated functions, these specific fits over all my possible training data sets, what I get is my average fit. So now let's talk about this a little bit more formally. We had already presented this in our previous video.

This $f_{\bar w}$ (f sub w bar). But now, let's define this. This is the expectation of a specific fit on a specific training data set or let me rephrase that, the fit I get on a specific training data set averaged over all possible training data sets of size N that I might get. So that is the formal definition of this $f_{\bar w}$ (f sub w bar), what we have been calling our average fit.

And what we're talking about when we're talking about bias is, we're talking about comparing this average fit to the true relationship. And here remember again, we're focusing specifically on some target $x_t$. And so the bias at $x_t$ is the difference between the true relationship at $x_t$ between $x_t$ and y. So between a given square feet and the house value whatever the true relationship is between that input and the observation versus this average relationship estimated over all possible training data sets.

Screenshot taken from Coursera 9:00

So that is the formal notion of bias of $x_t$, and let's just remember that when it comes in as our error term, we're looking at bias squared.

Screenshot taken from Coursera 9:25

So that's the second term. Now let's turn to this third term which is variance.

And let's go through this definition where again, we're interested in this average fit $f_{\bar w}$ (f sub w bar), this green dashed line. But that really isn't the quantity of interest. It's gonna be used in our definition here. But the thing that we're really interested in, is over all possible fits we might see. How much do they deviate from this expected fit?

Screenshot taken from Coursera 10:00

So thinking about again, specifically at our target $x_t$, how much variation is there in the training dataset specific fits across all training datasets we might see?

Screenshot taken from Coursera 10:15

And that's this variance term and now again, let's define it very formally.

Well let me first state what variance is in general. So variance of some random variable is simply looking at the expected value of that random variable minus its mean squared. So in this context, when we're looking at the variability of these functions at xt, we're taking the expectation and our random quantity is our estimated function for a specific training data set at $x_t$.

And then what's the mean of that random function? The mean is this average fit, this $f_{\bar w}$ (f sub w bar). So we're looking at the difference between fit on a specific training dataset and what I expect to earn averaged over all possible training datasets. I look at that quantity squared and what is my expectation taken over?

Let me just mention that this quantity when I take this squared, represents a notion of how much deviation a specific fit has from the expected fit at $x_t$. And then when I think about what the expectation is taking over, it's taking over all possible training data sets of size N. So that's my variance term.

And when we think intuitively about why it makes sense that we have the sum of these three terms in this specific form. Well what we're saying is variance is telling us how much can my specific function that I'm using for prediction. I'm just gonna use one of these functions for prediction. I get a training dataset that gives me an $f_{\bar w}$ (f sub w hat), I'm using that for prediction. Well, how much can that deviate from my expected fit over all datasets I might have seen.

So again, going back to our analogy, I'm a real estate agent, I grab my data set, I fit a specific function to that training data. And I wanna know well, how wild of a fit could this be relative to what I might have seen on average over all possible datasets that all these other realtors are using out there?

And so of course, if the function from one realtor to another realtor looking at different data sets can vary dramatically, that can be a source of error in our predictions. But another source of error which the biases is capturing is over all these possible datasets, all these possible realtors. If this average function just can never capture anything close to their true relationship between square feet and house value, then we can't hope to get good predictions either and that's what our bias is capturing. And why are we looking at bias squared? Well, that's putting it on an equal footing of these variance terms because remember bias was just the difference between the true value and our expected value. But these variance terms are looking at these types of quantities but squared. So that's intuitively why we get bias squared and then finally, what's our third sense of error?

Well let's say, I have no variance in my estimator always very low variance. And the model happens to be a very good fit so neither of these things are sources of error, I'm doing basically magically perfect on my modeling side, while still inherently there's noise in the data. There are things that just trying to form predictions from square feet alone can't capture. And so that's where irreducible error or this sigma squared is coming through. And so intuitively this is why our prediction errors are a sum of these three different terms that now we've defined much more formally.

Screenshot taken from Coursera 12:00

2) Formally defining the 3 sources of error

Why specifically these are the three sources of error, and why they appear as sigma squared plus bias squared plus variance.

Let's start by recalling our definition of expected prediction error, which was the expectation over trending data sets of our generalization error. And, here I'm using just a shorthand notation train instead of training set. (train = training set)

So let's plug in the formal definition of our generalization error. And remember that our generalization error was our expectation over all possible input and output pairs, X, Y pairs of our loss. And so that's what is written here on the second line. And then let's remember that we talked about specifying things specifically at a target $x_t$, and under an assumption of using a loss function of squared error. And so again we're gonna use this to form all of our derivations. And so when we make these two assumptions, then this expected prediction error at $x_t$ simplifies to the following where there's no longer an expectation over x because we're fixing our point in the input space to be $x_t$. And our expectation over y becomes an expectation over yt because we're only interested in the observations that appear for an input at xt. So, the other thing that we've done in this equation is we've plugged in our specific definition of our loss function as our squared error loss. So, for the remainder of this video, we're gonna start with this equation and we're gonna derive why we get this specific form, sigma-squared plus bias squared plus variance.

Screenshot taken from Coursera 2:00

Expected prediction error at $x_t$ $$= \large E_{train,y_t}[(y_t - f_{\hat w(train)}(x_t))^2]$$

So this is the definition of expected prediction error at $x_t$ that we had on the previous slide, under our assumption of squared error loss. What we can do is we can rewrite this equation as follows, where what we've done is we've simply added and subtracted the true function, the true relationship between x and y, specifically at xt. And because we've just simply added and subtracted these two quantities, nothing in this equation has changed as a result. $$= \large E_{train,y_t}[((y_t - f_{w(true)}(x_t)) + (f_{w(true)}(x_t) - f_{\hat w(train)}(x_t)))^2]$$

Let's do a little aside here, because it is useful. So if we take the expectation of some quantity: $$ \large E[(a + b)^2] \\ = E[a^2 + 2ab + b^2] \\ = E[a^2] + 2E[ab] + E[b^2]$$

I'm going to define some shorthand for writing purpose

  • $y_t$: y
  • $f_{w(true)}$: f
  • $f_{\hat w(train)}$: $\hat f$

Now that we've set the stage for this derivation, let's rewrite this term here. So we get the expectation over our training data set and our observation it's remember I'm writing $y_t$ just as y and I'm going to get the first term squared. So I'm going to get y- f. Squared that's my a squared term this first term here. And then I'm gonna get two times the expectation of a times b, and let me again specify what the expectations is over the expectations over training data set and observation Y. And when I so A times B I get Y minus F times F minus F hat. And then the final term is I'm going to get the expectation over my training set and the observation Y. Of B squared, which is F minus F hat squared. $$= \large E_{train,y}[(y-f)^2] +2E_{train,y}[(y - f)(f- \hat f)] + E_{train,y}[(f- \hat f)^2]$$

Now let's simplify this a bit.

Does anything in this first term depend on my training set? Well y is not a function of the training data, F is not a function of the training data, that's the true function. So this expectation over the training set, that's not relevant for this first term here. And when I think about the expectation over y, well what is this? This is the difference between my observation and the true function. And that's specifically, that's epsilon. So what this term here is, this is epsilon squared. And epsilon has zero mean so if I take the expectation of epsilon squared that's just my variance from the world. That's sigma squared. Okay so this first term results in sigma squared. $$ (y - f)^2 = \epsilon^2 =\sigma^2 $$

Now let's look at this second term, you know what, I'm going to write this a little bit differently to make it very clear here. So I'll just say that this first term here is sigma squared by definition. Okay, now let's look at this second term. And again what is Y minus F? Well Y minus F is this epsilon noise term and our noise is a completely independent variable from F or F hat.

  • If I take the expectation of A and B, where A and B are independent random variables, then the expectation of A times B is equal to the expectation of A times the expectation of B. So, this is another little aside. $$E[ab] = E[a]E[b] \text{, where a, b are independent variables.}$$

And, so what I'll get here, is I'm going to get that this term is the expectation of epsilon times the expectation of F minus F hat. And what's the expectation of epsilon, my noise? It's zero, remember we said that again and again, that we're assuming that epsilon is zero noise, that can be incorporated into F. This term is zero, the result of this whole thing is going to be zero. We can ignore that second term. $$E[(y - f)(f- \hat f)] \\ = E[\epsilon] E[f - \hat f] \\ = 0 \cdot E[f - \hat f] \\ = 0$$

Let's look at this last term and this term for this slide, I'm simply gonna call the mean squared error. I'm gonna define this little equal with a triangle on top is something that I'm defining here. I'm defining this to be equal to something called the mean square error, let me write that out if you want to look it up later. Mean square error of F hat. $$E[(f- \hat f)^2] = MSE(\hat f)$$

Now that I've gone through and done that, I can say that the result of all this derivation is that I get a quantity sigma squared. Plus mean squared error of F hat. $$\large E_{train,y}[(y-f)^2] +2E_{train,y}[(y - f)(f- \hat f)] + E_{train,y}[(f- \hat f)^2] \\ = \sigma^2 + MSE(\hat f)$$

But so far we've said a million times that my expected prediction error at $x_t$ is sigma squared plus bias squared plus variance. On the next slide what we're gonna do is we're gonna show how our mean squared error is exactly equal to bias squared plus variance.

Screenshot taken from Coursera 10:00

What I've done is I've started this slide by writing mean squared error of remember on the previous slide we were calling this $\hat f$, that was our shorthand notation.

$$MSE[f_{\hat w(train}(x_t)] = \\ E_{train}[(f_{w(true)}(x_t) - f_{\hat w(train)}(x_t))^2]$$
  • $f_{\hat w(train)}(x_t) = \hat f$

And so mean squared error of $\hat f$ according to the definition on the previous slide is it's looking at the expectation of F minus F hat squared. And I guess here I can mention when I take this expectation over training data and my observation Y. Does the observation Y appear anywhere in here, F minus F hat? No, so I can get rid of that Y there. If I look at this I'm repeating it here on this next slide where I have the expectation over my training data of my true function, which I had on the last slide just been denoting as simply F. And the estimated function which I had been denoting, let me be clear it's inside this square that I'm looking at I'd been denoting this as F hat. And both of these quantities were evaluated specifically at $x^t$.

$$= E_{train}[((f_{w(true)}(x_t) - f_{\bar w}(x_t)) + (f_{\bar w}(x_t) - f_{\hat w(train)}(x_t)))^2]$$

Again let's go through expanding this, where in this case, when we rewrite this quantity in a way that's gonna be useful for this derivation, we're gonna add and subtract $f_{\bar w}$ (F sub W bar) and what $f_{\bar w}$, remember that it was the green dashed line in all those bias variance plots. What $f_{\bar w}$ is looking average over all possible training data sets, where for each training data set, I get a specific fitted function and I average all those fitted functions over those different training data sets. That's what results in F sub W bar. It's my average fit that for my specific model that I'm getting averaging over my training data sets. And so for simplicity here, I'm gonna refer to $f_{\bar w}$ as $\bar f$.

  • $f_{\bar w} = \bar f$

Using that same trick of taking the expectation of A plus B squared and completing the square and then passing the expectation through, I'm going to do the same thing here $$= E_{train}[(f - \bar f)^2] + 2E_{train}[(f - \bar f)(\bar f - \hat f)] + E_{train}[(\bar f - \hat f)^2]$$

Now let's go through and talk about what each of these quantities is.

  • And the first thing is let's just remember that $\bar f$ what was the definition of $\bar f$ formerly? It was my expectation over training data sets of $\hat f$ of my fitted function on a specific training data set. I've already taken expectation over the training set here. F is a true relationship. F has nothing to do with the training data. This is a number. This is the mean of a random variable, and it no longer has to do with the training data set either. I've averaged over training data sets. Here there's really no expectation over trending data sets. Nothing is random in terms of the trending data set for this first quantity. So $\bar f = E_{train}[\hat f]$

  • This first quantity is really simply $(f - \bar f)^2$, and what is that? That's the difference between the true function and my average, my expected fit. Specifically at $x_t$, but squared. That is bias squared. That's by definition. So $$E_{train}[(f - \bar f)^2] = (f - \bar f)^2 = bias^2(\hat f)$$

Now let's look at this second term, and the second term is not a function of training data. So, this is just like a scaler. It can just come out of the expectation so for this second term I can rewrite this as $$2E_{train}[(f - \bar f)(\bar f - \hat f)] \\ = 2(f- \bar f) E_{train}[\bar f - \hat f]$$

  • Okay. And now let's re-write this term, and just pass the expectation through. And the first thing is again $\bar f$ is not a function of training data, so the result of that is just f bar and then i'm gonna get minus the expectation over my training data of f hat. $$E_{train}[\bar f - \hat f] = \bar f - E_{train}[\hat f]$$
  • So, what is this $E_{train}[\hat f]$? This is the definition of f bar. This is taking my specific fit on a specific, so it's the fit on a specific training data set at $x_t$ And it's taking the expectation over all training data sets. That's exactly the definition of what f bar is, that average fit. $$E_{train}[\hat f] = \bar f$$

  • So, this term here is equal to 0 $$E_{train}[\bar f - \hat f] \\ = \bar f - E_{train}[\hat f] \\ = \bar f - \bar f \\ = 0$$

That just leaves one more quantity to analyze and that's the last term here where what I have is an expectation over a function minus it's mean squared. So, let me just write this in words. It's an expectation of let's just say, so the fact that I can equivalently write this as F hat minus F bar squared. I hope that's clear that the negative sign there doesn't matter. It gets squared. They're exactly equivalent. And so what is this?

  • $\hat f$: this is a random function at $x_t$ which is equal to just a random variable.
  • $\bar f$: and this is its mean.

And so the definition of taking the expectation of some random variable minus its mean squared, that's the definition of variance. So, this term is the variance of f hat. $$E_{train}[(\bar f - \hat f)^2] = E[(\hat f - \bar f)^2] = var(\hat f)$$

Screenshot taken from Coursera 19:00

That's exactly what we're hoping to show because now we can talk about putting it all together. Where what we see is that our expected prediction error at $x_t$ we derived to be equal to Sigma squared plus mean squared error. And then we derived the fact that mean squared error is equal to bias squared plus variance. So, we get the end result that our expected prediction error at Xt is sigma squared plus bias squared plus variance, and this represents our three sources of error. And we've know completed our formal derivation of this.

Screenshot taken from Coursera 20:00

5) Putting the pieces together

1) Training/validation/test split for model selection, fitting, and assessment

Let's wrap up by talking about two really important task when you're doing regression. And through this discussion, it's gonna motivate another important concept of thinking about validation sets.

So, the two important task in regression, is first we need to choose a specific model complexity. So for example, when we're talking about polynomial regression, what's the degree of that polynomial? And then for our selected model, we assess its performance. And actually these two steps aren't specific gesture regression. We're gonna see this in all different aspects of machine learning, where we have to specify our model and then we need to assess the performance of that model. So, what we're gonna talk about in this portion of this module generalizes well beyond regression. And for this first task, where we're talking about choosing the specific model. We're gonna talk about it in terms of sum set of tuning parameters, lambda, which control the model complexity. Again, and for example, lambda might specify the degree of the polynomial and polynomial aggression.

Screenshot taken from Coursera 1:00

So, let's first talk about how we can think about choosing lambda. And then for a given model specified by lambda, a given model complexity, let's think about how we're gonna assess the performance of that model.

Well, one really naive approach is to do what we've described before, where you take your data set and split it into a training set and a test set. And then, what we're gonna do is for our model selection portion where we're choosing the model complexity lambda.

For every possible choice of lambda, we're gonna estimate model parameters associated with that lambda model on the training set. And the we're gonna test the performance of that fitted model on the test set. And we're gonna tabulate that for every lambda that we're considering. And we're gonna choose our tuning parameters as the ones that minimize this test error. So, the ones that perform best on the test data. And we're gonna call those parameters lambda star.

So, now I have my model. I have my specific degree of polynomial that I'm gonna use. And I wanna go and assess the performance of this specific model. And the way I'm gonna do this is I'm gonna take my test data again. And I'm gonna say, well, okay, I know that test error is an approximation of generalization error. So, I'm just gonna compute the test error for this lambda star fitted model. And I'm gonna use that as my approximation of the performance of this model. Well, what's the issue with this? Is this gonna perform well? No, it's really overly optimistic.

Screenshot taken from Coursera 2:50

So, this issue is just like what we saw when we weren't dealing with this notion of choosing model complexity. We just assumed that we had a specific model, like a specific degree polynomial. But we wanted to assess the performance of the model. And the naive approach we took there was saying, well, we fit the model to the training data, and then we're gonna use training error to assess the performance of the model. And we said, that was overly optimistic because we were double dipping. We already used the data to fit our model. And then, so that error was not a good measure of how we're gonna perform on new data.

Well, it's exactly the same notion here and let's walk through why. Most specifically, when we're thinking about choosing our model complexity, we were using our test data to compare between different lambda values. And we chose the lambda value that minimized the error on that test data that performed the best there. So, you could think of this as having fit lambda, this model complexity tuning parameter, on the test data. And now, we're thinking about using test error as a notion of approximating how well we'll do on new data. But the issue is, unless our test data represents everything we might see out there in the world, that's gonna be way too optimistic. Because lambda was chosen, the model was chosen, to do well on the test data and so that won't generalize well to new observations.

Screenshot taken from Coursera 4:00

So, what's our solution? Well, we can just create two test data sets. They won't both be called test sets, we're gonna call one of them a validation set. So, we're gonna take our entire data set, just to be clear. And now, we're gonna split it into three data sets.

One will be our training data set, one will be what we call our validation set, and the other will be our test set. And then what we're gonna do is, we're going to fit our model parameters always on our training data, for every given model complexity that we're considering. But then we're gonna select our model complexity as the model that performs best on the validation set has the lowest validation error. And then we're gonna assess the performance of that selected model on the test set. And we're gonna say that that test error is now an approximation of our generalization error. Because that test set was never used in either fitting our parameters, w hat, or selecting our model complexity lambda, that other tuning parameter. So, that data was completely held out, never touched, and it now forms a fair estimate of our generalization error.

Screenshot taken from Coursera 5:00

So in summary, we're gonna fit our model parameters for any given complexity on our training set. Then we're gonna, for every fitted model and for every model complexity, we're gonna assess the performance and tabulate this on our validation set. And we're gonna use that to select the optimal set of tuning parameters lambda star. And then for that resulting model, that w hat sub lambda star, we're gonna assess a notion of the generalization error using our test set.

Screenshot taken from Coursera 6:00

And so a question, is how can we think about doing the split between our training set, validation set, and test set? And there's no hard and fast rule here, there's no one answer that's the right answer. But typical splits that you see out there are something like an 80-10-10 split. So, 80% of your data for training data, 10% t for validation, 10% for tests. Or another common split is 50%, 25%, 25%. But again, this is assuming that you have enough data to do this type of split and still get reasonable estimates of your model parameters, reasonable notions of how different model complexities compare. Because you have a large enough validation set, and you still have a large enough test set in order to assess the generalization error of the resulting model. And if this isn't the case, we're gonna talk about other methods that allow us to do these same types of notions, but not with this type of hard division between training, validation, and test.

Screenshot taken from Coursera 7:00

2) A brief recap

Screenshot taken from Coursera 1:00